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Abstract 

User  relevance  feedback  is  usually  utilized  by  Web  sys¬ 
tems  to  interpret  user  information  needs  and  retrieve  ef¬ 
fective  results  for  users.  However,  how  to  discover  useful 
knowledge  in  user  relevance  feedback  and  how  to  wisely 
use  the  discovery  knowledge  are  two  critical  problems.  In 
TREC  2009,  we  participated  in  the  Relevance  Feedback 
Track  and  experimented  a  model  consisting  of  two  inno¬ 
vative  stages:  one  for  subject-based  query  expansion  to 
extract  pseudo-relevance  feedback;  one  for  relevance  fea¬ 
ture  discovery  to  find  useful  patterns  and  terms  in  relevance 
judgements  to  rank  documents.  In  this  paper,  the  detailed 
description  of  our  model  is  given,  as  well  as  the  related  dis¬ 
cussions  for  the  experimental  results. 


1  Introduction 

Web  users’  personal  interests  and  preferences  can  be 
drawn  in  their  user  profiles.  In  Web  information  gather¬ 
ing,  user  profiles  are  used  by  many  works  to  search  infor¬ 
mation  for  users  according  to  their  personal  needs  [3, 10]. 
However,  effectively  acquiring  user  profiles  is  difficult.  To 
acquire  user  profiles,  some  techniques  explicitly  interview 
users  [13],  some  use  user  relevance  feedback  [14].  These 
mechanisms  require  user-effort  in  the  user  profile  acquisi¬ 
tion  process.  Attempting  to  release  such  burden  from  users, 
alternatively  some  automatic  techniques  have  been  devel¬ 
oped  to  acquire  user  profiles  from  a  collection  of  user  per¬ 
sonal  information,  for  example,  browsing  history  [3,  17]. 
User  profiles  acquired  by  such  techniques,  however,  usually 
contain  noise  and  uncertainties.  Hence,  a  method  to  acquire 
user  profiles  effectively  and  efficiently  (without  the  burden 
of  user-effort)  is  an  urgent  need  for  personalized  Web  infor¬ 
mation  gathering. 


Relevance  features  describe  what  a  user  wants.  They 
can  be  discovered  from  user  relevance  feedback.  Over  the 
years,  pattern-based  approaches  have  been  expected  to  out¬ 
perform  term-based  techniques  when  discovering  relevance 
features.  Patterns  are  more  discriminative  and  carry  more 
“semantics”.  However,  according  to  information  retrieval 
(1R)  experiments,  few  significant  improvements  have  been 
made  by  using  pattern-based  methods  to  replace  term-based 
methods  [15,16].  When  utilizing  pattern  mining  techniques, 
people  encountered  two  problems:  (i)  high  frequent  patterns 
are  usually  general,  whereas  specific  patterns  are  usually 
with  low  frequency  (this  is  because  the  measuring  methods 
for  pattern  learning,  such  as  “support”  and  “confidences”, 
appeared  unsuitable  in  the  filtering  stage  [11]);  (ii)  negative 
user  feedback  is  difficult  to  use  when  revising  the  features 
extracted  from  the  positive  user  feedback.  Relevance  fea¬ 
ture  discovery  is  challenging  [10, 12], 

Motivated  by  these  challenges,  we  proposed  a  relevance 
feature  discovery  model  and  tested  the  model  in  the  Rele¬ 
vance  Feedback  track  in  TREC  2009.  This  Relevance  Feed¬ 
back  track  was  designed  to  evaluate  a  system’s  capacity  of 
finding  quality  user  relevance  feedback,  as  well  as  its  rel¬ 
evance  feedback  algorithms.  Thus,  two  phases  were  con¬ 
ducted  in  the  track  corresponding  to  this  design:  (i)  identi¬ 
fying  a  small  number  of  documents  for  (pseudo)  relevance 
feedback;  (ii)  running  relevance  feedback  algorithms  with 
relevance  judgements.  In  accordance  to  the  two  phases, 
we  participated  with  also  a  two-stage  information  filtering 
model:  (i)  subject-based  query  expansion  for  pseudo  rele¬ 
vance  feedback  extraction;  (ii)  pattern-based  relevance  fea¬ 
ture  discovery  using  both  positive  and  negative  feedback. 
The  model  aimed  to  discover  relevance  features  for  Web 
user  profile  acquisition. 

The  first  stage  was  to  expend  a  query  (topic)  to  retrieve 
pseudo  relevance  feedback.  To  expand  queries,  we  used  a 
subject  ontology  LCSH  (Library  of  Congress  Subject  Head- 
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ings).  The  ontology  specified  commonsense  knowledge  ob¬ 
tained  by  people  through  their  experience  and  education, 
and  was  successfully  evaluated  in  our  prior  work  reported 
in  [18].  Given  a  query,  the  topic-related  subjects  were  ex¬ 
tracted  from  the  LCSH  ontology.  On  the  basis  of  these  sub¬ 
jects,  user  background  knowledge  was  discovered  and  a  per¬ 
sonalized  ontology  was  constructed.  Based  on  the  person¬ 
alized  ontology  and  using  an  information  gathering  system, 
a  training  set  (consisting  of  a  positive  and  a  negative  sub¬ 
sets)  was  extracted  from  the  ClueWeb09  Category-B  cor¬ 
pus  based  on  title  search,  and  treated  as  pseudo  relevance 
feedback. 

At  the  second  stage,  relevance  features  were  discovered 
from  both  positive  and  negative  pseudo  relevance  feedback, 
using  a  model  introduced  in  [9],  These  relevance  features 
consisted  of  high-level  pattern  features  and  low-level  term 
features.  Based  on  the  high-level  features,  the  low-level 
features  were  classified  into  three  groups:  positive  specific 
terms,  general  terms,  and  negative  specific  terms.  When  ap¬ 
plying  negative  patterns  to  revise  the  discovered  features, 
we  increased  the  weight  of  positive  specific  terms  but  de¬ 
clined  that  of  negative  specific  terms.  This  feature  revision 
went  into  a  loop  to  optimize  the  relevance  feature  extrac¬ 
tion.  Finally,  documents  highly  relevant  to  these  relevance 
features  were  retrieved  from  the  ClueWeb09  Category-B  as 
the  final  submission  results. 

In  this  paper,  the  two-stage  model  and  the  related  evalua¬ 
tion  in  TREC  2009  Relevance  Feedback  track  are  presented 
and  discussed.  Section2  introduces  the  subject-based  query 
expansion,  and  Section  3  presents  relevance  feature  discov¬ 
ery  using  positive  and  negative  samples.  After  that,  the  eval¬ 
uation  results  are  discussed  in  Section  4.  Finally,  the  last 
section  makes  conclusions. 


2  Subject-Based  Query  Expansion  for 
Pseudo  Relevance  Feedback 

The  first  stage  aims  to  automatically  retrieve  pseudo  rele¬ 
vance  feedback  from  the  ClubWeb09  Category-B.  Because 
there  was  only  a  limited  number  of  terms  in  given  topics, 
the  key  issue  here  was  how  to  acquire  user  interest  from 
the  limited  information.  In  this  work,  we  utilized  a  world 
knowledge  ontology  to  analyze  the  concepts  in  the  given 
topics.  For  an  incoming  topic,  the  positive  subjects  were  ex¬ 
tracted  from  the  ontology.  Based  on  these  subjects  and  their 
referring-to  instances,  user  background  knowledge  was  dis¬ 
covered  and  utilized  to  expand  the  given  query  terms  and 
to  search  the  ClueWeb09  Category-B  for  pseudo  relevance 
feedback.  The  top  five  ranked  results  were  considered  rel¬ 
evance  feedback  from  users.  Figure  1  illustrates  the  archi¬ 
tecture  of  our  Stage  1  process. 


Figure  1.  The  Stage  1  Architecture 


2.1  World  Ontology  and  Instances 

The  world  ontology  was  encoded  from  the  Fibrary  of 
Congress  Subject  Headings1,  a  library  catalog  system.  The 
FCSH  system  is  a  categorization  developed  for  organiz¬ 
ing  the  large  volumes  of  library  collections  and  for  retriev¬ 
ing  information  from  the  library.  The  references  specified 
in  FCSH  for  subject  headings  were  encoded  into  the  se¬ 
mantic  relations  associated  with  and  linking  the  subjects, 
where  Broader  term/Narrower  term  were  for  is-a,  Used- 
for  for  part-of  and  related-to  for  related-to  relations.  The 
FCSH  ontology  contained  about  400,000  topical,  geograph¬ 
ical,  and  corporate  subjects. 

The  FCSH  ontology  was  populated  using  the  instances 
encoded  from  the  information  items  in  a  library  catalog  2. 
Figure  2  illustrates  a  sample  information  item  for  instances. 
The  descriptive  information,  such  as  the  title  and  table  of 
contents,  are  the  knowledge  resource  extensive  from  the 
FCSH  ontology.  Such  descriptive  information  was  used  for 
the  content  of  an  instance.  A  list  of  indexed  content-based 
descriptors  (subjects)  is  cited  by  each  item  (instance).  Thus, 
we  could  have  a  matrix  constructed  by  instances  and  sub¬ 
jects.  Each  instance  may  cite  a  list  of  subjects,  and  each 
subject  may  refer  to  a  list  of  instances.  Based  on  this  ma¬ 
trix,  the  belief  ( bel )  of  an  instance  to  a  subject  can  be  deter¬ 
mined: 

^  ^  index(s,i )  x  |r/(i)|’ 

where  r](i)  is  the  set  of  subjects  cited  by  i,  index(s,i)  is 
the  index  (starting  with  one)  of  s  on  the  citing  list.  Us- 

1  http://classificationweb.net/. 

-In  particular,  the  QUT  library.  For  the  sake  of  simplicity,  only 
the  abstracted  information  (title,  table  of  content,  and  summary)  was 
used  to  represent  an  instance.  Example  of  instances  can  be  found  on 
http://www.library.qut.edu.au. 
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Figure  2.  An  Instance  from  A  Library  Catalog  Item 


ing  the  instance  displayed  in  Fig.  2  as  a  sample,  let  i  be 
this  instance;  s  be  the  subject  Consumption  (Economics)- 
Germany  (East).  We  have  index(s,i )  =  1  and  \ij(i)\  =  4, 
and  can  thus  calculate  bel(i,s)  =  0.25.  The  less  subjects 
cited  by  an  instance  and  the  higher  index  a  subject  on  a  cit¬ 
ing  list,  the  stronger  belief  the  instance  holds  to  the  subject. 
The  bel(i,s)  will  be  used  to  select  the  right  instances  to 
populate  the  LCSH  ontology. 

A  method,  specificity  [18, 19]  (denoted  as  spe),  was  fur¬ 
ther  utilized  to  measure  the  focus  of  a  subject  in  the  LCSH 
ontology.  The  subjects  located  at  upper  bound  levels  in  the 
ontology  are  more  abstractive  than  those  at  lower  bound 
levels  towards  the  “leaves”.  Also,  upper  bound  level  sub¬ 
jects  have  more  descendant  subjects  in  shadow,  in  compari¬ 
son  with  lower  bound  level  subjects.  Thus,  an  upper  bound 
subject  has  weaker  focus  than  a  lower  bound  subject  in  its 
shadow. 

The  spe  value  of  a  subject  s  is  determined  by  analyzing 
its  associated  hierarchical  relations  of  is -a  and  part-of  By 
setting  the  spe  value  for  “leave”  subjects  as  1,  toward  the 
root  of  the  ontology,  the  spe  value  decreases  for  each  level 
up.  If  a  subject  has  all  direct  child  subjects  in  shadow  with 
is-a  relationship,  the  smallest  spe  of  its  child  subjects  is 
chosen  for  the  subject’s  spe  value  by  decreasing  10%.  If  a 
subject  has  all  direct  child  subjects  in  shadow  with  part-of 
relationship,  its  spe  is  defined  as  the  average  spe  value  of 
its  child  subjects,  applying  the  10%  decreasing  rate.  If  the 
direct  child  subjects  in  shadow  are  mixed  with  is-a  and  part- 


of  relations  to  their  parent  subject,  two  spe s  are  calculated: 
one  for  is-a  child  subjects,  and  one  for  part-of  subjects.  The 
smaller  spe  is  then  chosen  to  value  the  spe  of  the  parent 
subject.  As  a  result,  the  specificity  of  a  upper  bound  subject 
is  guaranteed  smaller  than  that  of  a  lower  bound  subject  in 
its  shadow. 

2.2  Interesting  Subject  Discovery 

Given  a  topic  T  :=  {G ,  t2,  -  •  • ,  tn},  two  sets  of  subjects 
were  extracted  from  the  LCSH  ontology:  positive  subjects 
A +  being  relevant  to  the  topic;  and  negative  subjects  S~ 
being  paradoxical  or  ambiguous  to  the  topic.  If  a  subject’s 
label  contains  any  keywords  in  the  topic  (label(s)  D  T  ^ 
0),  this  subject  is  extracted  and  put  into  the  initial  positive 
subject  set  ( S+  =  <S+  U  {s}).  The  positive  level  of  s  to  T 
is  thus  measured  by 

pos(s,T)  =  spe(s)  x  | label(s)  (T  T|  x  sup(i,T) 

where 

sup(i,  T)  =  E  bel{i,s')  x  \label(s')  HT\ 

s'£r)(i) 

as  defined  previously,  p(i)  refers  to  the  set  of  subjects  cited 
by  i,  and  77 -1(s)  gives  the  set  of  instances  citing  s. 

The  reachable  ancestor  and  descendant  subjects  of  s  in 
the  ontology  were  also  extracted.  The  “reachable”  here  is 
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limited  to  the  distance  of  three  edges  in  the  ontology.  The 
subjects  located  more  than  that  distance  are  unlikely  impor¬ 
tant  to  T,  as  reported  by  [6].  These  reachable  subjects  were 
extracted  and  put  into  the  negative  subject  set  (S~). 

User  background  knowledge  was  discovered  from  the 
reference  between  the  subjects  and  their  instances.  Let 
si  G  5+  and  s2  G  S~ .  If  ?7-1(si)  IT  ^~1(s2)  7^  0,  Si 
and  s2  have  something  in  common  and  are  relevant.  The 
certainty  level  of  s2  being  positive  was  thus  determined  by 
its  linked  positive  subjects  (e.g.  si  G  <S+).  A  subject  is 
more  interesting  if  it  has  more  linked  positive  subjects.  Let 
S(s)  be  the  set  of  linked  positive  subjects  of  s  G  S~,  we 
measure  the  certainty  level  of  s  to  T  by: 


pos(s,  T\s  G  S  ) 


T,s'eS(s)conf(s'’s)  x  pos(s',T) 
W(s)\ 


where 


con f  (s',  s) 


\V  1  (s')  Try  1(s)| 

JTV) 


Considering  such  discovered  user  background  knowledge, 
if  a  s  G  S~  has  pos(s,T)  >  0,  it  would  be  removed  from 
S~  and  replaced  to  A  . 


2.3  Query  Expansion  for  Pseudo  Rele¬ 
vance  Feedback  Extraction 


The  query  terms  were  expanded  based  on  the  positive 
subjects  discovered  in  the  previous  section.  In  Section  2.2, 
a  set  of  positive  subjects  <S+  was  discovered,  in  which  each 
subject  was  assigned  a  pos  value  indicating  the  certainty 
level  of  the  subject  being  relevant  to  the  given  topic.  In  Sec¬ 
tion  2.1,  we  know  that  a  subject  refers  to  a  set  of  instances. 
Thus,  a  training  set  D+  could  be  generated,  in  which  each 
document  d  was  from  the  content  of  an  instance  i  referred 
to  by  a  positive  subject  s  G  S+.  A  support  value  was 
calculated  for  each  document  in  the  training  set,  by  accu¬ 
mulating  all  pos  values  of  the  subjects  on  the  citing  list  of 
the  instance.  The  expanding  terms  were  extracted  from  the 
training  set. 

The  training  set  was  first  used  to  evaluate  weights  for 
a  set  of  selected  terms  T.  After  text  pre-processing  of 
stopword  removal  and  word  stemming,  the  semantic  space 
referred  to  by  a  d  was  represented  by  its  normal  form 
(5(d)  =  {(ti.wi),  (t2,w2),  ■  ■  ■ ,  (ffc,  wk)},  where  w  is  the 
weight  distribution  of  terms  and  Wi  =  and  fa  is  the 

E  j  =  1  J  j 

term  frequency  of  f,  in  d.  A  probability  function  on  T  was 
derived  based  on  the  normal  form  of  positive  documents  and 
their  supports  for  all  t.  G  T: 

Prp(t)  =  E  support(d)  x  w 

d£D+  ,(t,w)£/3(d) 
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Figure  3.  The  Stage  2  Architecture 


The  terms  with  top  150  prp(t)  values  were  then  selected  to 
expand  the  query  terms  given  in  T.  The  details  of  evalua¬ 
tion  can  be  referred  to  [10]. 

The  documents  in  the  ClueWeb09  corpus  were  indexed 
by  accumulating  the  prp(t)  of  the  expanded  top  150  terms 
that  occurred  in  the  document  titles.  Because  ClueWeb09 
Category-B  is  a  large  corpus,  in  order  to  reduce  the  com¬ 
plexity,  only  the  title  of  documents  counted  into  this  index 
calculation.  The  top  five  indexed  documents  were  chosen 
as  the  pseudo  relevance  feedback  from  users,  and  submitted 
as  the  results  for  Phase  1  of  the  track. 

3  Relevance  Feature  Discovery 

Relevance  feature  discovery  aims  to  discover  a  set  of  fea¬ 
tures  from  text  documents  to  describe  what  a  user  wants. 
In  Phase  2  of  TREC’09  Relevance  Feedback  track,  a  given 
topic  was  represented  by  a  set  of  user  judgements  contain¬ 
ing  documents  associated  with  values  of  0,  1,  or  2,  indi¬ 
cating  being  non-relevant,  relevant,  and  highly  relevant  to 
the  topic,  respectively.  Treating  the  documents  associated 
with  1  and  2  as  equally  positive  and  those  with  0  negative, 
we  had  two  different  sets:  positive  and  negative  feedback. 
In  this  Stage  2  method,  relevance  features  were  to  be  dis¬ 
covered  from  both  of  the  positive  and  negative  relevance 
feedback. 

When  generating  the  positive  and  negative  feedback,  two 
special  problems  were  encountered:  (i)  positive  feedback 
was  unavailable  because  all  judgements  were  with  0  (non- 
relevant).  For  this  problem,  we  formed  a  positive  document 
by  using  the  query  terms  expanded  in  Stage  1  (as  discussed 
in  Section  2.3),  and  weighted  these  terms  equally  as  1;  (ii) 
negative  feedback  was  unavailable  because  all  judgement 
were  with  1  or  2.  For  this  problem,  we  used  only  positive 
feedback  for  feature  discovery. 
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Table  1.  A  set  of  paragraphs 


Paragraph 

Terms 

dpi 

t\  t2 

dp2 

(•3  G  te 

dp3 

td,  G  (-5  te 

dpA 

te  (4  te  te 

dp5 

ti  t2  te  tr 

dp6 

t\  t2  te  ty 

The  pattern-based  features  were  first  extracted  from  the 
positive  user  feedback.  After  that,  these  features  were  used 
to  iteratively  select  and  re-select  meaningful  negative  doc¬ 
uments  (called  offenders  in  this  paper)  from  the  negative 
feedback.  These  offenders  were  used  to  revise  the  extracted 
features.  Finally,  the  revised  features  were  used  to  retrieve 
the  final  results  from  the  ClueWeb09  Subset-B.  Figure  3  il¬ 
lustrates  the  architecture  of  our  model  in  Stage  2. 

3.1  Frequent  and  Closed  Sequential  Pat¬ 
terns 

For  a  given  topic,  relevance  feature  discovery  extracts 
from  a  document  set  a  set  of  features,  including  patterns 
and  terms,  and  assigns  them  weights.  The  document  set, 
usually  called  a  training  set  and  denoted  as  D,  consists  of  a 
set  of  positive  documents  ( D+ )  and  a  set  of  negative  docu¬ 
ments  (D~).  When  splitting  a  document  into  paragraphs,  a 
document  d  can  also  be  represented  by  a  set  of  paragraphs 
PS(d). 

LetT  =  {f  i ,  f  2 1  •  •  •  ,tm}  be  a  set  of  terms  extracted  from 
D+;  X  be  a  set  of  terms  (called  a  termset )  in  document  d. 
coverset(X)  denotes  the  covering  set  of  X  for  d,  which 
includes  all  paragraphs  dp  £  PS(d)  where  X  C  dp,  i.e., 
coverset(X)  =  {dp\dp  €  PS(d),X  C  dp}.  The  abso¬ 
lute  support  of  X  is  the  number  of  occurrences  of  X  in 
PS(d):  supa(X)  =  \coverset}X)\.  The  relative  support 
of  X  is  the  fraction  of  the  paragraphs  that  contain  the  pat¬ 
tern:  supr(X)  =  ^ C°V\ps(d ■  A  termset  X  is  then  called 
a.  frequent  pattern  if  its  supa  (or  supr)  >  minsup,  a  min¬ 
imum  support. 

Table  1  lists  a  set  of  paragraphs  for  a  document  d,  where 
PS(d)  =  {dpi ,  dp-2, . . . ,  dpa}  with  duplicate  terms  re¬ 
moved.  Assume  min.sup  =  3,  ten  frequent  patterns  would 
be  extracted  as  shown  in  Table  2. 

Given  a  set  of  paragraphs  Y  C  PS(d),  we  can  also  de¬ 
fine  its  termset ,  which  satisfies 

termset(Y)  =  {t\\/dp  £Y=>t£  dp}. 

By  defining  the  closure  of  X  as: 

Cls(X)  =  termset{coverset(X)) 


Table  2.  Frequent  patterns  and  covering  sets 


Frequent  Pattern  Covering  Set 


{t3, t4, t6} 

{dp2,dp3,dp4} 

{t3,t4} 

{dp2,  dp3,  dp4} 

{t3,  te} 

{dp2,dp3,dp4} 

{L l,  ^6 } 

{dp2,dp3,dp4} 

{t3} 

{dp2,dp3,dp4} 

{U} 

{dp2,dp3,dp4} 

{ti,t2} 

{dpi,dp5,dp6} 

{h} 

{dpi,dp5,dpe} 

{t2} 

{dpi,dpe,dpe} 

{Te} 

{ dp2 ,  dp3 ,dp4,dp5,  dpe } 

a  pattern  (or  termset)  X  is  closed  if  and  only  if  X  = 
Cls(X). 

Let  X  be  a  closed  pattern.  We  have 

SUpa(X i)  <  SUPa(X)  (1) 

for  all  patterns  X4  D  X . 

A  taxonomy  can  be  constructed  by  using  closed  pat¬ 
terns  with  is-a  (or  subset )  relations.  Table  2  contains  three 
closed  patterns,  <  t3,t4,te  >,  <  t\,ti  >,  and  <  t6  >, 
within  ten  frequent  patterns.  After  pruning  the  non-closed 
patterns,  a  pattern  taxonomy  PT  can  be  constructed,  like 
PT  =  {(t3,t4,te),  ((1,(2),  (^e) }  in  Table  2  when  consid¬ 
ering  ( t6 )  a  subset  of  (f3,  t4l  t6). 

Small  patterns  (e.g.  (te))  in  a  taxonomy  are  usually  gen¬ 
eral  because  they  have  more  chance  to  be  used  frequently. 
Vice  versa,  large  patterns  (e.g.  (f3,f4,f6))  are  relatively 
specific  because  they  usually  have  a  low  frequency. 

A  sequential  pattern  s  =<  t\,...,tr  >  (ti  £  T)  is  an  or¬ 
dered  list  of  terms.  Denoted  by  si  Q  S2,  a  sequence  54  =< 
x\, . . . ,  Xi  >  is  a  sub-sequence  of  S2  =<  yi,  ■  ■  ■ ,  Dj  >, 
iff  3ji,  such  that  1  <  ji  <  ji  ■  ■  ■  <  ji  <  j  and 

xi  =  yh,x 2  =  yj2,...,Xi  =  yu.  Given  Si  C  s2,  we 
call  si  a  sub-pattern  of  S2,  and  S2  a  super-pattern  of  si.  To 
simplify  the  explanation,  we  refer  to  sequential  patterns  as 
patterns. 

As  the  same  as  those  defined  for  normal  patterns,  we 
define  the  absolute  support  and  relative  support  for  a  pat¬ 
tern  (an  ordered  termset )  X  in  d.  We  also  denote  the  cov¬ 
ering  set  of  X  as  coverset(X),  which  includes  all  para¬ 
graphs  ps  £  PS(d)  such  that  X  C  ps,  i.e.,  coverset(X)  = 
{ps\ps  £  PS(d),X  C  ps}.  X  is  then  called  a  frequent  pat¬ 
tern  if  supr(X)  >  minsup.  By  using  Eq.  (1),  a  frequent 
sequential  pattern  X  is  closed  if  $  any  super-pattern  X\  of 
X  such  that  supa(Xi)  =  supa{X). 
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3.2  Deploying  High-Level  Patterns  on 
Low-Level  Terms 


To  overcome  the  problem  of  patterns  with  low- 
frequency,  a  method  was  developed  to  deploy  high  level 
patterns  over  low-level  terms.  The  evaluation  of  term  sup¬ 
ports  (weights)  in  this  paper  is  different  from  that  in  term- 
based  approaches.  For  a  term-based  approach,  the  value  of 
a  term  is  scaled  based  on  its  appearance  in  documents.  In 
our  method,  the  value  of  terms  are  scaled  based  on  their 
appearance  in  discovered  patterns. 

To  improve  the  efficiency  of  the  pattern  taxonomy  min¬ 
ing  (PTM),  an  algorithm,  SP  Mining)  D+ ,  min  sup) ,  was 
introduced  by  [21]  and  further  developed  in  [11,20]  to  find 
closed  sequential  patterns  from  positive  documents  D+. 
The  SPMining  algorithm  used  the  well-known  Apriori  prop¬ 
erty  to  narrow  down  the  searching  space. 

Let  SPi,  SP2,  ...,  SPn  be  the  sets  of  discovered  closed 
sequential  patterns  for  all  documents  di  £  D+(i  = 
1,  •  •  •  ,  n),  where  n  =  |D+|.  For  a  given  term  t,  its  weight 
in  discovered  patterns  is  assigned  by: 


w[t,  D+) 


n 


E  E 

i—1 


SUpr(p ,  di) 


\P\ 


(2) 


where  \p\  is  the  number  of  terms  in  p. 

With  weights  assigned  to  the  terms  in  D+ ,  a  function 
can  be  used  to  rank  and  judge  the  relevance  of  incoming 
documents: 


rank(d)  =  w(t)r(t,  d) 

teT 

where  w(t)  =  w(t,D+)\  and  r(t,d)  =  1  if  t  £  d,  other¬ 
wise  r(f,  d)  =  0. 


3.3  Mining  Negative  Patterns  for  Revis¬ 
ing  Low-Level  Features 


In  general  speaking,  the  definition  of  relevance  is  sub¬ 
jective.  People  may  describe  the  relevance  of  a  topic  (or  a 
document)  in  two  dimensions,  specificity  and  exhaustivity, 
where  specificity  describes  the  focus  of  the  topic  on  what 
users  want,  and  exhaustivity  describes  the  extent  of  the  topic 
dealing  what  users  want.  Such  two-dimension  description  is 
easy  for  human  beings  to  use,  however,  difficult  for  a  com¬ 
putational  system  to  apply.  In  this  section,  we  first  discuss 
how  to  use  the  two  dimensions  to  understanding  the  seman¬ 
tic  meanings  of  low-level  feature  terms.  We  also  present  an 
algorithm  for  negative  pattern  discovery  and  term  weight 
revision. 


3.3.1  Specific  and  General  Features 

Let  DP+  be  the  union  of  all  patterns  in  pattern  taxonomies 
discovered  from  D+ ,  and  DP~  be  the  union  of  all  negative 


patterns  in  the  pattern  taxonomies  discovered  from  D~ .  A 
closed  sequential  pattern  of  D+  (or  D~)  is  called  a  positive 
pattern  (or  negative  pattern). 

Given  a  term  t  £  T,  its  exhaustivity  refers  to  the  num¬ 
ber  of  discovered  patterns  containing  t  in  both  DP+  and 
DP~ ,  and  its  specificity  refers  to  the  number  of  discovered 
patterns  containing  t  in  only  DP+  but  not  DP  .  Based 
on  these,  we  can  classify  terms  into  three  groups:  general 
terms  (GT.)  for  those  appearing  in  both  positive  patterns 
and  negative  patterns;  positive  specific  terms  (T+)  for  those 
appearing  in  only  positive  patterns;  negative  specific  terms 
(T~)  for  those  appearing  in  only  negative  patterns.  They 
are  defined  by: 

GT  =  {t\(3p1  £  DP+)A(3(p2  £  DP~)  =>  t  £  (p1r\p2)}, 

T+  =  {t\t  GT ,  3(p  £  DP+)  =>  t  £  p},  and 
T~  =  {t\t  i  GT,  3 (p  £  DP~ )  =>  t£p } 

where  GT  n  T+  n  T~  =  0. 

Specific  terms  contain  more  semantic  meanings  and  dis¬ 
tinguish  a  topic  from  others.  Thus,  specific  terms  are  use¬ 
ful  to  describe  the  relevance  feature  of  a  topic.  However, 
using  specific  terms  alone  may  be  insufficient  when  trying 
to  improve  the  performance  of  relevance  feature  discovery. 
Documents  containing  no  specific  terms  may  also  highlight 
user  information  needs  as  well.  Therefore,  one  possible  so¬ 
lution  is  to  use  the  hybrid  of  specific  terms,  general  terms, 
and  negative  terms.  However,  adequate  control  is  necessary 
for  the  side  effects  generated  by  using  general  terms. 

3.3.2  Revision  Strategy 

In  this  section,  we  discuss  the  basic  strategies  of  revising  the 
features  discovered  from  a  training  set.  This  feature  revising 
process  takes  place  only  after  terms  are  classified  into  three 
categories  of  general,  positive  specific ,  and  negative  specific 
terms. 

From  the  positive  documents  in  a  training  set,  the  revis¬ 
ing  process  first  discovers  initial  positive  features  includ¬ 
ing  high-level  positive  patterns  and  low-level  terms.  Select¬ 
ing  some  negative  samples  from  the  negative  documents  in 
the  training  set,  the  process  also  discovers  negative  patterns 
and  terms  by  using  the  same  pattern  mining  technique  as 
that  used  for  positive  feature  discovery.  The  process  then 
revises  the  initial  features  to  obtain  revised  features.  This 
process  can  be  repeated  several  times:  selecting  negative 
documents,  mining  negative  features  and  revising  revised 
features. 

Algorithm  NFMining(D)  describes  the  details  of  the  the 
revision  strategy,  with  an  assumption  that  the  number  of 
negative  documents  is  greater  than  the  number  of  positive 
documents.  For  a  given  training  set  D  =  {D+ ,  IT  },  we  as¬ 
sume  that  the  initial  features,  (DP+ ,  DP~  ,T),  have  been 
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scribes  as: 


Algorithm  1.  NFMinmg(D) 

Input:  A  training  set,  {/)  1  .  i  )  } .  a  =  —  1; 

extracted  features  (DP+,  DP~ ,  T),  DP~  =  0; 
support  function,  minimum  support  min_sup, 
and  experimental  parameters  K  and  a. 

Output:  Updated  term  set  T  and  function  weight. 

Method: 

1:  GT  =  0,  T+  =  0,  T'  =  0,  loop  =  0; 

2:  foreach  t  £  T  do 
3:  weight(t)  =  weight(t,  D+); 

4:  foreach  d  £  D~ do 

5:  ranked )  =  S  t^dn^TUT-^weight(t); 

6:  let  T>_  =  { do .  di, ...,  d  D-  _  ,  }  in  descendent  order, 
let  j  =  i t  if  loop  =  0,  otherwise  j  =  0; 

7:  Df  =  {di|di  G  D~,j  <i<I\  +  )}; 

8:  DP~  =SPMining(Df ,  min_sup);  //find  negative  patterns 
9:  To  =  {t  S  p|p  £  DP~  };  //  all  terms  in  negative  patterns 
10:  foreach  t  £  (To  —  T)  do 

11:  if  ( loop  =  0)  then  weight(t)  =  a  x  weight(t ,  D^~) 

else  weightft )  =  a  x  weight(t ,  D^")  +  weight(t)\ 
12:  T~  =  T-  U  (To  -  T),  loop  +  +; 

13:  if  loop  <  3  then  goto  step  4; 

14:  foreach  t  £  T  do  //term  partition 
15:  if  (f  e  T~)  then  GT  =  GT  U  {/} 
else  T+  =  T+  U  {/}; 

16:  foreach  t  £  T+  do 

17:  weight(t)  =  weight  (t. )  x  (1  +  f g d } | 

18:  T  =  T  U  T~ ; 


Table  3.  Example  of  a  set  of  terms  discovered 
from  DP+,  DP+  g  D+  and  \D+\  =  6. 


term  weight  #  of  docs  that  include  the  term 


(tl) 

0.34 

4 

(tf) 

0.90 

6 

(ts) 

0.55 

3 

(U) 

0.65 

5 

(ts) 

0.75 

6 

(te) 

0.84 

2 

extracted  from  positive  documents  D+  before  the  algorithm 
starts,  where  T  =  {t  G  p\p  G  DP+}  and  DP~  =  0.  The 
experimental  parameter  is  set  as  a  =  —1  to  calculate  the 
weights  of  terms  in  negative  patterns. 

Step  1  initializes  the  sets  of  general  terms  GT,  positive 
specific  terms  T+,  and  negative  specific  terms  T~ .  loop  is 
used  to  control  the  number  of  revision  cycles.  Step  2  and  3 
compute  weights  for  all  terms  in  T.  Table  3  shows  a  set  of 
terms  and  their  weights  deploying  from  positive  patterns.  In 
experiments,  when  positive  documents  were  unavailable,  a 
set  of  100  terms  with  weight  set  to  1  from  query  expansion 
(as  discussed  in  Section  2.3)  were  used  as  positive  terms. 

Steps  4  and  5  rank  documents  in  the  negative  document 
set.  If  t  is  a  negative  specific  term,  its  has  an  revising  weight 
evaluated  in  step  10  and  11.  The  weight  function  is  de¬ 


{its  revising  weight,  if  t  G  T~ 

support(t,  D+),  otherwise 

Steps  6  and  7  sort  the  negative  documents  based  on  their 
rank  values,  and  select  offenders  (meaningful  negative  doc¬ 
uments).  A  document  is  considered  negative  to  the  topic  if 
it  is  ranked  lower  than  or  equal  to  0.  For  the  first  loop  the 
minimum  weight  that  we  can  get  is  0  because  there  is  no 
negative  weight  in  the  term  set  T.  However,  from  the  next 
loop  some  negative  terms  from  D~  with  negative  weight 
are  added.  Then  it  is  most  likely  to  get  weight  less  than  0. 
If  a  document  has  a  high  rank,  the  document  is  selected  as 
an  offender  because  it  forces  the  system  to  make  a  mistake. 
The  offenders  are  normally  defined  as  the  top-AT  negative 
documents  in  sorted  D~  [10],  Given  that  positive  docu¬ 
ments  are  the  main  source  of  features,  we  expect  the  total 
number  of  offenders  not  more  than  the  positive  documents. 
Therefore,  we  set  K  =  in  our  experiments.  In  the 

first  revision  (loop  =  0),  where  T  contains  only  positive 
terms  and  no  negative  terms  having  added  yet,  the  top-) 
negative  documents  are  omitted  for  offender  selection.  The 
initial  features  come  from  positive  documents  only,  and  the 
positive  features  are  more  important  than  negative  features 
at  the  beginning.  An  experimental  parameter  a  is  used  here 
and  set  as  a  =  LjS+|j  • 

To  be  clear.  Table  3  and  4  are  used  as  an  example  for 
the  selection  of  offenders  process.  Table  4  shows  a  list  of 
ranked  negative  documents  using  the  terms  appearing  in  Ta¬ 
ble  3.  The  first  step  is  to  eliminate  the  documents  with 
weight  less  than  or  equal  0.  Thus,  de,  dy  from  Table  4  are  ig¬ 
nored  for  offenders.  For  the  sample  shown  on  Table  3  and  4, 
the  number  of  training  documents  is  13  with  a  distribution 
of  \D+\  =  6  and  \D~\  =  7.  Therefore,  I\  =  [|]  =2  and 
if  ( loop  =  0)  then  j  =  a  =  |_gj  =  1;  otherwise,  j  =  0.  Af¬ 
ter  that,  started  from  j  +  1  and  counting  for  K  documents, 
the  documents  in  this  range  are  selected  as  offenders.  As 
a  result  c/:s ,  d  \  from  Table  4  are  selected  as  offenders  at  the 
first  loop  ( loop  =  0).  In  the  second  and  third  loops  the  same 
process  is  repeated  with  j  =  0  and  the  updated  list  of  terms 
is  used. 

Steps  8  and  9  extract  negative  features  ( DP~,Tq ) 
from  selected  negative  documents  Df.  The  SPMin- 
ing(Df .  rriin_sup')  algorithm  is  employed  to  discover  neg¬ 
ative  patterns  DP~  and  T0,  including  all  terms  in  patterns 
of  DP  .  Table  5  shows  a  list  of  terms  extracted  from  of¬ 
fenders. 

Steps  10  to  12  revise  the  weights  for  negative  specific 
terms.  These  steps  go  three  times  through  a  loop  with  the 
iteration  controlled  by  Step  13.  In  each  loop,  if  a  specific 
negative  term  is  extracted  at  the  first  time,  the  algorithm 
negates  its  support  obtained  from  the  selected  negative  doc- 
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Table  4.  A  set  of  ranked  negative  documents 
with  their  weight,  \D~\  =  7. 


Negative  documents 

weight 

1 

di 

0.67 

2 

d,2 

0.60 

3 

ds 

0.44 

4 

di 

0.34 

5 

d$ 

0.30 

6 

d§ 

0.00 

7 

dr 

0.00 

Table  5.  A  set  of  terms  discovered  from  of¬ 
fender  documents. 


terms 

weight 

(ti) 

-0.20 

(£a) 

-0.45 

(£7) 

-0.50 

(£s) 

-0.75 

uments;  otherwise,  the  algorithm  cumulates  its  weight  as 
follows: 

weight{t )  =  a  x  weightit,  D^)  +  weight(t). 

After  three  loops,  the  algorithm  partitions  T  into  general 
terms  GT  and  positive  specific  terms  T+  at  Step  14  and 
15.  It  also  revises  positive  specific  term  weights  using  the 
following  equation  in  Step  16  and  17: 

■  ,,  /I  ,  |{d|de  D+ te  d}K  ^ 

weignt(t)  =  weight(t)  x  (1-f - jll+l -  (3) 

Finally,  T  is  updated  to  include  negative  specific  terms 
at  Step  18. 

Table  3  and  5  show  a  set  of  terms  extracted  from  pos¬ 
itive  documents  and  offenders.  The  method  introduced  in 
Section  2.3  is  again  used  to  classify  those  terms  into  three 
main  groups:  specific  positive,  specific  negative,  and  gen¬ 
eral  terms: 

T+  =  {(£2)0.90’  (£4)0.65’  (£5)0.75’  (£5)0.84} 

T  =  {(£f)-0.50’ (£s)-0.75} 

G=m 

(0.34^0r2tO’  (£3)  (0.65^0r45f  } 

The  terms  in  T+  and  T~  have  only  one  weight.  How¬ 
ever,  the  terms  in  general  group  G  have  two  weights:  the 
first  one  is  for  the  term  occurred  in  D  1 :  the  second  one 
is  for  the  term  occurred  in  offenders  ZTj".  Because  the 
group  T+  is  more  important  than  T~  and  G,  the  weight 
of  a  t  £  T+  is  awarded  by  Eq.  (3)  based  on  V s  appearance 


on  positive  documents.  For  negative  terms  T~ ,  the  term 
weights  are  updated  via  a  three-loops  technique  as  shown  at 
Step  11.  The  groups  of  terms  with  updated  weights  are: 

=  {(£2)l.8=0.90*(l  +  §)’  (£4)1.19;  (£5)1.5’  (£6)1.12} 

T  =  }(£7)-0.50’ (£s)-0.75} 

G  =  {(£l)o. 34’  (£3)0.65} 

NFMining  calls  three  times  SPMining.  The  total  num¬ 
ber  of  negative  documents  used  in  these  three  times  equals 
0(\D+\).  Therefore,  NFMining  for  mining  negative  pat¬ 
terns  has  the  same  complexity  as  the  SPMining  for  mining 
positive  patterns  in  D+ .  NFMining  also  takes  times  for  sort¬ 
ing  D~ ,  assigning  weights  to  terms,  and  partitioning  terms 
into  categories.  The  time  complexity  for  these  operations  is 
0{\D-\{log\D-\  +  \T\)  +  \T\*). 

3.4  Final  Retrieval 

Given  a  topic,  the  feature  terms  are  extracted  by  using 
Algorithm  NFMining  and  assigned  with  a  value  weight(t), 
as  discussed  previously.  These  features  were  used  in  our  ex¬ 
periments  to  perform  the  final  retrieval.  Because  the  volume 
of  ClueWeb09  Category-B  corpus  is  huge,  the  final  retrieval 
was  separated  to  two  steps  in  order  to  reduce  the  complex¬ 
ity. 

At  the  first  step,  for  each  topic  we  retrieved  about  30,000 
candidate  documents  based  on  only  title  search  from  the 
ClueWeb09  Category-B  corpus.  The  process  of  query  ex¬ 
pansion  (discussed  in  Section  2.3)  was  reused  here  for  can¬ 
didate  retrieval.  In  our  investigation  on  the  results  of  Phase 
1  submission,  a  limitation  was  exposed  that  the  knowledge 
specified  in  the  world  ontology  was  not  up-to-date.  The 
LCSH  system  used  for  ontology  construction  was  the  2006 
version.  As  a  result,  the  ontology  missed  some  up-to-date 
knowledge,  e.g.,  that  about  “Obama”  and  “Obama  family 
tree”.  In  order  to  solve  this  problem,  at  Stage  2  we  used 
world  knowledge  extracted  from  the  Web  using  Google 
API.  For  each  topic,  ten  Web  documents  were  retrieved  and 
pooled  with  the  training  set  generated  from  the  instances 
(library  catalog).  As  discussed  in  Section  2.3,  a  set  of  ex¬ 
panding  query  terms  was  then  extracted  and  used  for  can¬ 
didate  retrieval.  Finally,  approximately  30,000  candidate 
documents  were  retrieved  from  the  Category-B  corpus  by 
accumulating  the  prp(t)  of  the  terms  that  occurred  in  the 
document  titles. 

In  the  next  step,  we  filtered  the  candidates  based  on 
document  contents  using  the  features  discovered  from  pos¬ 
itive  and  negative  judgements,  as  discussed  previously. 
The  30,000  candidates  were  re-ranked  by  accumulating  the 
weight(t)  of  features  (see  Algorithm  NFMining)  that  oc¬ 
curred  in  document  contents.  After  that,  the  top  1,000  doc¬ 
uments  were  selected  and  submitted  as  the  final  retrieved 
results  against  the  given  topic. 


Topic 

eMap 

StatAP 

Score 

Topic 

eMap 

StatAP 

Score 

1 

13 

11 

12 

12 

0.371 

26 

5 

19 

9 

16 

0.6557 

2 

10 

13 

10 

15 

0.6613 

27 

16 

9 

16 

9 

0.3 

3 

4 

11 

7 

9 

0.5676 

28 

16 

9 

21 

4 

0.2459 

4 

17 

1 

17 

1 

0.2982 

29 

12 

3 

11 

8 

0.3235 

5 

12 

4 

9 

12 

0.4286 

30 

10 

15 

8 

17 

0.6508 

6 

19 

5 

19 

5 

0.2545 

31 

21 

4 

18 

1 

0.2419 

7 

8 

15 

9 

16 

0.7258 

32 

19 

6 

11 

14 

0.3621 

8 

6 

11 

12 

1 

0.5 

33 

12 

12 

10 

14 

0.5088 

9 

13 

10 

9 

16 

0.5345 

34 

16 

7 

16 

8 

0.3519 

10 

15 

7 

17 

6 

0.3585 

35 

9 

15 

7 

17 

0.5893 

11 

9 

14 

13 

11 

0.4833 

36 

11 

12 

15 

8 

0.4074 

12 

21 

4 

14 

11 

0.2344 

37 

15 

2 

11 

7 

0.2857 

13 

11 

1 

7 

6 

0.4194 

38 

10 

14 

7 

18 

0.5873 

14 

12 

13 

14 

11 

0.45 

39 

13 

10 

11 

14 

0.4483 

15 

9 

16 

12 

13 

0.5333 

40 

8 

1 

12 

5 

0.2 

16 

10 

10 

10 

15 

0.5536 

41 

4 

17 

7 

15 

0.6226 

17 

23 

1 

19 

6 

0.1455 

42 

17 

3 

7 

6 

0.25 

18 

12 

11 

6 

18 

0.6481 

43 

9 

15 

9 

16 

0.5517 

19 

6 

0 

0 

0 

0 

44 

9 

11 

13 

11 

0.434 

20 

- 

- 

- 

- 

- 

45 

18 

7 

8 

17 

0.5937 

21 

11 

12 

16 

7 

0.5085 

46 

8 

15 

9 

15 

0.5902 

22 

9 

16 

9 

16 

0.7187 

47 

8 

13 

9 

13 

0.5357 

23 

5 

11 

11 

7 

0.5 

48 

15 

4 

14 

6 

0.25 

24 

8 

8 

10 

3 

0.3421 

49 

11 

9 

10 

10 

0.3958 

25 

10 

15 

10 

15 

0.6562 

50 

13 

8 

12 

9 

0.375 

All 

9 

16 

13 

12 

0.4844 

Table  6.  Evaluation  of  Phase  1  performance 


4  Results  and  Discussions 

As  discussed  previously,  the  Relevance  Feedback  track 
was  designed  to  evaluate  a  system’s  capacity  of  finding 
quality  user  relevance  feedback  and  utilizing  relevance 
judgement.  In  Phase  1,  each  group  submitted  five  docu¬ 
ments  for  (pseudo)  relevance  feedback;  in  Phase  2,  groups 
ran  their  relevance  feedback  algorithms  based  on  different 
sets  of  judged  docs  from  Phase  1,  including  their  own  Phase 

1  docs,  and  several  other  groups’  Phase  1  documents.  Eval¬ 
uation  then  compared  the  intrinsic  quality  of  the  Phase  1 
feedback,  as  well  as  each  group’s  relevance  feedback  algo¬ 
rithm. 

Four  methods,  eMap  [1],  MapA,  P10A,  and  StatAP  [2], 
were  used  in  the  track  to  measure  the  performance  of  Phase 

2  runs.  eMap  and  StatAP  were  applied  to  the  runs  us¬ 
ing  the  testing  set  of  only  ClueWeb09  Category-B,  whereas 
MapA  and  P10A  were  applied  to  those  using  the  whole 
ClueWeb09  English  set.  Because  our  experiments  were 
based  on  only  ClueWeb09  Category-B,  measuring  our  per¬ 
formance  by  MapA  and  P10A  might  not  give  us  an  ade¬ 
quate,  substantial  analysis.  Thus,  we  investigated  our  re¬ 
sults  with  only  the  eMap  and  StatAP  in  this  discussion. 

The  quality  of  a  set  of  Phase  1  extracted  documents  could 
be  marked  if  more  groups  using  the  set  in  Phase  2  had  better 
performance  than  using  other  Phase  1  sets,  when  applying 
to  the  same  relevance  feedback  algorithm.  Table  6  shows 
the  detailed  results  for  the  evaluation  of  our  Phase  1  re¬ 


sults.  In  each  eMap  or  StatAP  column,  the  first  digit  shows 
the  number  of  runs  that  using  our  Phase  1  set  was  outper¬ 
formed  by  using  another  groups’  Phase  1  sets,  whereas  the 
second  digit  shows  the  number  of  runs  that  using  ours  out¬ 
performed  using  others.  Therefore,  a  larger  deviation  of  two 
digits  indicates  higher  quality  of  our  pseudo  relevance  feed¬ 
back  retrieved  in  Phase  1  when  the  second  digit  is  greater 
than  the  first.  In  Table  6,  those  tie  or  wining  comparisons 
are  flagged  by  the  bold,  italic  font.  In  terms  of  eMap  per¬ 
formance,  using  our  Phase  1  retrieved  feedback  was  better 
then  (or  equal  to)  using  other  groups’  retrieved  feedbacks 
in  23  out  of  49  topics  (Topic  20  was  dropped  because  it 
had  no  relevant  docs).  In  terms  of  StatAP,  the  tie  or  wining 
topic  number  is  24  out  of  49.  In  overall  eMap  performance 
of  counting  49  topics,  the  number  of  runs  our  Phase  1  set 
was  better  than  is  16,  much  more  than  the  number  of  runs 
(9)  our  Phase  1  set  was  worse  than.  In  overall  StatAP  per¬ 
formance,  the  two  numbers  in  the  pair  is  quite  close  (13 
vs.  12).  Base  on  the  results,  the  pseudo  relevance  feedback 
retrieved  by  our  group  in  Phase  1  had  a  relatively  high  qual¬ 
ity.  This  is  also  confirmed  by  the  performance  comparisons 
illustrated  in  Fig.  4,  where  our  submission  (QUT.l)  is  in¬ 
dexed  in  a  middle  position  (ahead  of  16  groups  but  behind 
13  groups).  Out  system’s  capacity  of  finding  quality  user 
relevance  feedback  is  encouraging. 

Phase  2  evaluated  a  system’s  performance  of  using  rel¬ 
evance  judgement  for  retrieval.  The  Stage  2  in  our  model 
was  to  use  both  positive  and  negative  feedback  judgements 
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Measure  stAP 


Phase  1  Set:  Fraction  Each  Set  is  Superior  to 


CM 

CO 


Phase  1  Set 

Figure  4.  Phase  1  Performance  Comparison 
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Figure  5.  Phase  2  Performance  Comparison 
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for  information  retrieval.  Though  many  reports  suggested 
that  negative  relevance  judgements  were  useless  or  of  a  lit¬ 
tle  help  [4,5,7],  this  idea  has  been  successfully  tested  in  our 
previous  work  [9]  on  an  experimental  environment  setup  by 
Reuters  Corpus  Volume  1  (RCV1)  corpus  [8]  and  TREC  fil¬ 
tering  track.  The  work  showed  that  the  method  significantly 
outperformed  both  the  state-of-the-art  term-based  methods 
underpinned  by  Okapi  BM25  or  Support  Vector  Machine 
and  pattern  based  methods  on  precision,  recall  and  F  mea¬ 
sures.  However,  in  this  track  our  Phase  2  performance 
was  unsatisfactory,  according  to  the  comparison  plotted  in 
Fig.  5.  In  our  investigation,  we  found  that  the  unsatisfac¬ 
tory  performance  was  largely  caused  by  the  difficulties  en¬ 
countered  when  coping  with  the  large  testbed,  ClueWeb09 
Category-B. 

Performing  content  search  in  ClueWeb09  Category-B 
for  each  topic  was  time  and  computational  resource  con¬ 
suming  that  we  could  not  afford,  according  to  the  track’s 
tough  schedule  and  our  accessible  resources.  ClueWeb09 
Category-B  is  a  huge  corpus  with  1 .5  terabyte  data,  approxi¬ 
mate  45,000,000  documents.  Pre-processing  of  ClueWeb09 
Category-B  required  investment  of  a  large  amount  of  time 
and  use  of  high  performance  computer.  Unfortunately,  as 
the  first  time  in  our  lab  to  deal  with  the  High  Performance 
Computer  (HPC)  Centre  in  QUT,  the  poor  collaboration  and 
the  shortage  of  HPC  experience  stole  a  large  amount  of  our 
time.  As  a  result,  time  became  against  us  in  the  experi¬ 
ments.  Consequently,  in  order  to  simplify  the  complexity 
in  maximum  with  only  minimal  sacrifice  of  effectiveness, 
as  discussed  in  Section  3.4  we  separated  the  Phase  2  search 
into  two  steps:  for  each  topic,  (i)  retrieving  about  30,000 
candidates  from  ClueWeb09  Category-B  based  on  only  title 
search;  (ii)  re-ranking  those  candidates  based  on  contents 
and  submitting  the  top  1,000  documents  as  the  final  results. 
We  expected  with  30,000  candidates  we  could  have  only  a 
limited  portion  of  relevant  documents  missing.  However,  as 
shown  on  Fig.  5,  the  final  result  of  Phase  2  was  disappoint¬ 
ing. 

The  evaluation  methods  and  our  Stage  2  method  have  a 
basic  difference  on  term  weight  evaluation.  This  may  also 
cause  the  disappointing  result  in  Phase  2.  dVlap  and  StatAP 
are  term-based  methods  that  evaluate  term  weights  based 
on  term  distribution  in  documents.  Due  to  the  large  vol¬ 
ume,  the  ClueWeb09  corpus  does  not  have  precise  judge¬ 
ments  for  the  testing  set  (like  those  manual  judgements  in 
RCV1  for  topics  R101-R150  in  TREC  11  Filtering  track). 
In  order  to  test  a  relevance  feedback  method,  based  on 
term-based  algorithms,  eMap  and  StatAP  computationally 
judged  the  testing  set.  However,  our  Stage  2  method  is 
pattern-based.  Term  weights  are  evaluated  based  on  term 
distribution  in  discovered  patterns  rather  than  that  in  doc¬ 
uments  (as  discussed  in  Section  3).  Therefore,  there  may 
exist  a  problem  that  the  performance  of  our  pattern-based 


method  could  be  underestimated  when  using  term-based 
computational  judgements  to  measure.  This  problem  ac¬ 
tually  happened  in  our  previous  experiments:  when  us¬ 
ing  RCVl’s  manual  judgements  (topics  R101-R150),  this 
pattern-based  Stage  2  method  was  largely  succeed  in  the 
experiments  and  significantly  improved  the  performance  of 
an  information  filtering  system  from  using  Rocchio,  BM25, 
and  SVM  [9];  however,  such  performance  improvement 
became  relatively  slight  when  experimented  with  RCVl’s 
computational  judgements  (topics  R151-R200).  Though  at 
this  stage  it  is  still  too  early  to  justify  this  problem,  it  will 
be  interesting  to  investigate  this  problem  in  our  future  work 
and  test  our  pattern-based  method  with  more  data  sets. 

5  Conclusion 

This  paper  investigated  a  model  that  was  experimented  in 
the  TREC  2009  Relevance  Feedback  track.  The  model  had 
two  stages,  corresponding  to  the  design  of  the  track.  Given 
a  topic,  the  first  stage  of  our  model  used  a  world  knowl¬ 
edge  ontology  to  discover  user  background  knowledge  for 
query  expansion,  and  then  retrieved  the  pseudo  relevance 
feedback.  From  both  the  positive  and  negative  user  rele¬ 
vance  judgements,  the  second  stage  method  mined  specific 
and  general  features,  and  used  these  features  to  benefit  in¬ 
formation  retrieval.  According  to  the  evaluation  results,  the 
model  performed  well  in  Stage  1  but  unsatisfactory  in  Stage 
2.  The  unsatisfactory  performance  was  caused  by  the  dif¬ 
ficulties  in  coping  with  the  large  ClueWeb09  Category-B 
corpus. 

Our  participation  on  this  TREC  2010  Relevance  Feed¬ 
back  track  was  an  innovative  exploration  of  using  both 
positive  and  negative  feedback  judgements  in  information 
retrieval.  The  participation  also  demonstrated  that  us¬ 
ing  a  world  knowledge  ontology  is  capable  of  discovering 
user  background  knowledge  and  improving  information  re¬ 
trieval.  In  our  future  work,  further  investigation  and  ex¬ 
periments  will  be  carried  on  based  on  full  content  search 
on  ClueWeb09  Category-B,  rather  than  half  title-search  half 
content-search  in  this  reported  experiment. 
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